SemanticScuttle - klotz.me » Tags: google deepmind

Tags: google deepmind*

0 bookmark(s) - Sort by: Date ↓ / Title /

This page provides GGUF quantized versions of DiffusionGemma 26B A4B-it, a multimodal model from Google DeepMind based on the Gemma 4 architecture. The model employs discrete text diffusion through block-autoregressive multi-canvas sampling to achieve significantly faster decoding speeds than standard autoregressive models. It is capable of processing interleaved inputs consisting of text, images with variable resolutions and aspect ratios, and video content for generating textual outputs.
Key topics:
- Mixture-of-Experts architecture with 3.8 billion active parameters.
- High-speed generation through parallel denoising of token blocks.
- Multimodal input support including image and video understanding.
- Extensive context window capability up to 256K tokens.
- Integrated reasoning modes for step-by-step thought processes.

2026-06-12 Tags: diffusion gemma, unsloth, gguf, google deepmind, multimodal model, mixture of experts, llm by klotz

Google DeepMind Releases Gemma 4 12B: An Encoder-Free Multimodal Model with Native audio that runs on a 16 GB laptop

Google DeepMind has released the Gemma 4 12B, a dense multimodal model featuring an encoder-free architecture. Unlike previous iterations that used separate vision and audio encoders, this model allows these modalities to flow directly into the LLM backbone. This streamlined design reduces latency and memory overhead, allowing the model to perform agentic reasoning tasks on consumer laptops with as little as 16 GB of VRAM while approaching the performance levels of much larger models like the 26B MoE variant.

- Unified decoder-only architecture for text, image, video, and native audio input.
- Encoder-free design using a 35M vision embedder and direct raw audio wave projection.
- Optimized to run locally on Apple Silicon Macs and consumer GPU laptops.
- Released under an Apache 2.0 license with support for llama.cpp, MLX, vLLM, and Ollama.

2026-06-06 Tags: google deepmind, gemma 4, multimodal, audio, large language model, machine learning, agents by klotz

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Google has introduced Gemma 4 12B, a mid-sized multimodal model designed to bring agentic intelligence directly to consumer laptops. This model bridges the gap between smaller edge models and larger Mixture of Experts versions by offering high performance with a significantly reduced memory footprint. A key innovation is its encoder-free architecture, which allows vision and audio inputs to flow directly into the language model backbone rather than relying on separate, latency-inducing encoders.
Main topics:
Novel unified architecture without multimodal encoders
Native support for direct audio and vision input processing
Optimized for local execution on hardware with 16GB of RAM
Reasoning performance nearing much larger 26B models
Released under an Apache 2.0 license
Integrated Multi-Token Prediction drafters to reduce latency

2026-06-03 Tags: gemma 4 12b, google deepmind, multimodal model, encoder-free architecture, local ai, agentic workflows, developer tools, apache 2.0 by klotz

I finally found an open-source local LLM that actually competes with cloud AI

The author explores the utility of Google DeepMind's Gemma 4 as a powerful option for running large language models locally on consumer hardware. By testing the E4B variant using tools like LM Studio and llama.cpp, they demonstrate how open-weight models can handle multimodal tasks including text, image analysis, and audio processing with impressive precision and privacy.

2026-05-12 Tags: gemma 4, google deepmind, local llm, multimodal, llama.cpp by klotz

Accelerating Gemma 4: faster inference with multi-token prediction

Google has released Multi-Token Prediction (MTP) drafters for the Gemma 4 model family to significantly accelerate inference speeds. By utilizing a specialized speculative decoding architecture, these drafters can deliver up to a 3x speedup without compromising output quality or reasoning capabilities. This technology addresses memory-bandwidth bottlenecks by allowing a lightweight drafter to predict multiple future tokens that are then verified in parallel by the larger target model.
Key points:
* Improved responsiveness for real-time chat, voice applications, and agentic workflows.
* Faster local development on personal computers and consumer GPUs.
* Enhanced performance and battery efficiency on edge devices.
* Architectural optimizations including KV cache sharing and activation utilization.
* Available now under the Apache 2.0 license via Hugging Face and Kaggle.

2026-05-05 Tags: gemma 4, multi-token prediction, mtp, speculative decoding, inference speed, google deepmind, llm efficiency by klotz

Google DeepMind’s Research Lets an LLM Rewrite Its Own Game Theory Algorithms — And It Outperformed the Experts

Google DeepMind has introduced AlphaEvolve, an LLM-powered evolutionary coding agent that automates the design of algorithms for Multi-Agent Reinforcement Learning (MARL) in imperfect-information games. Using Gemini 2.5 Pro to mutate Python source code, the system discovered two novel algorithms: VAD-CFR and SHOR-PSRO. These evolved algorithms matched or surpassed state-of-the-art hand-designed baselines in various scenarios, including poker and Liars Dice. The research highlights the ability of automated search to discover non-intuitive mechanisms, such as volatility-adaptive discounting and hybrid meta-solvers, which generalize effectively to larger, unseen games, proving that LLMs can evolve complex algorithmic logic more efficiently than manual human iteration.

2026-04-04 Tags: google deepmind, alphaevolve, llm, game theory, multi-agent reinforcement learning, marl, vad-cfr, shor-psro, gemini 2.5 pro by klotz

Evaluating Collective Behaviour of Hundreds of LLM Agents

>As autonomous agents powered by LLM are increasingly deployed in society, understanding their collective behaviour in social dilemmas becomes critical. We introduce an evaluation framework where LLMs generate strategies encoded as algorithms, enabling inspection prior to deployment and scaling to populations of hundreds of agents—substantially larger than in previous work. We find that more recent models tend to produce worse societal outcomes compared to older models when agents prioritise individual gain over collective benefits. Using cultural evolution to model user selection of agents, our simulations reveal a significant risk of convergence to poor societal equilibria, particularly when the relative benefit of cooperation diminishes and population sizes increase. We release our code as an evaluation suite for developers to assess the emergent collective behaviour of their models

2026-02-21 Tags: llm, social dilemma, emergent behaviour, richard willis jianing zhao, yali du, joel z. leibo, king’s college london, google deepmind by klotz

Google DeepMind Finds a Fundamental Bug in RAG: Embedding Limits Break Retrieval at Scale

Google DeepMind research reveals a fundamental architectural limitation in Retrieval-Augmented Generation (RAG) systems related to fixed-size embeddings. The research demonstrates that retrieval performance degrades as database size increases, with theoretical limits based on embedding dimensionality. They introduce the LIMIT benchmark to empirically test these limitations and suggest alternatives like cross-encoders, multi-vector models, and sparse models.

2025-09-05 Tags: rag, retrieval-augmented generation, embeddings, google deepmind, limit benchmark, ai, machine learning, sparse models, cross-encoders, multi-vector models by klotz

Google DeepMind Introduces Differentiable Cache Augmentation: A Coprocessor-Enhanced Approach to Boost LLM Reasoning and Efficiency

Researchers from Google DeepMind have developed Differentiable Cache Augmentation, a method that uses a coprocessor to augment LLM's key-value cache with latent embeddings, enhancing reasoning capabilities without increasing computational burden.

"The methodology revolves around a three-stage process. First, the frozen LLM generates a kv-cache from an input sequence, encapsulating its internal representation. This kv-cache is passed to the coprocessor, which processes it with additional trainable soft tokens. Not tied to specific words, these tokens act as abstract prompts for generating latent embeddings. Once processed, the augmented kv-cache is fed back into the LLM, enabling it to generate contextually enriched outputs. This asynchronous operation ensures the coprocessor’s enhancements are applied efficiently without delaying the LLM’s primary functions. Training the coprocessor is conducted using a language modeling loss, focusing solely on its parameters while preserving the integrity of the frozen LLM. This targeted approach allows for scalable and effective optimization."

2024-12-28 Tags: google deepmind, differentiable cache augmentation by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

Tags: google deepmind*

Linked Tags

Related Tags